Rows: 1,338
Columns: 7
$ age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…
Brief Analysis
In assignment 6, we were given a csv file named insurance. Insurance has 1338 observations and 7 variables. The 7 variables where: age, sex, bmi, children, smoker, region, and charges. Then we were asked to compare the variables to eachother and show this comparison through different graphs. The major point of all of the variables was how they impacted the relationship to charges. Basically, how does each variable compare to charges.
Rows: 1,338
Columns: 7
$ age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…
Rows: 1,338
Columns: 7
$ age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…
[1] 1338
Number of variables
[1] 7
The amount is relatively evenly spread out between the four regions. However Southeast has the most and northeast has the least. All of the regions hvae over 300 beneficiaries.
Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region.
The middle of the histogram is approximately 30. The right side of the histogram is smoother, while the left side of the curve is steeper.
The histogram has a steep left side and a smoother right side. the highest point is the second bin in the graph. the right side smooths out and then keeps a semi consistent tail providing a rightward skew.
The north east and north west distributions are nearly identical. The south west is just above those two with its median and quarters. The south east is above all of them, with the highest median, QR1, QR3, maximum, and minimum.
There is a positive relationship between age and charges. However, nonsmokers generally have lower charges.
The smooth line is in the middle of the data points, but few of the data points are close to the line. most are above or below by at least $5000. The line is an accurate way to describe the data, but the data is also nowhere near the line. So, I do not think the line is a good way to summarize the data.
The line is much closer to the data points. The data below the line is nearly on it, but data above is farther away and more random. I think the line is a good representation. This is what I said in assignment six. But when going over this, I do not like the line as a summary.
Create models using other variables including BMI, number of children, and region. This builds a comprehensive model for predicting the medical charges.
Most beneficiaries have 0 or 1 child. Next most is 2, then 3, and finally 5 children is represented the least by the beneficiaries. ALmost half of the beneficiaries have no children.
The median charges is more or less constant no matter of the amount of children. The max amount of money payed based on charges was a beneficiary with 0 children. Beneficiaries with 2 children have the highest QR3. Beneficiaries with 1 child have the lowest QR1. Beneficiaries with 4 children have the highest median.
---
title: "Assignment 7"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatcj: default
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(tidyverse)
insurance <- read_csv("insurance.csv")
```
Overall Analysis
===
Column {data-width=450}
---
### Glimpse of Insurance
```{r}
glimpse(insurance)
```
Column {data-width=550}
---
Brief Analysis
In assignment 6, we were given a csv file named insurance. Insurance has 1338 observations and 7 variables. The 7 variables where: age, sex, bmi, children, smoker, region, and charges. Then we were asked to compare the variables to eachother and show this comparison through different graphs. The major point of all of the variables was how they impacted the relationship to charges. Basically, how does each variable compare to charges.
Question 1 and Question 2
===
Column {data-width=500}
---
### Read the data file insurance.csv using the read_csv() function in tidyverse.
```{r}
insurance <- read_csv("insurance.csv")
glimpse(insurance)
```
Column {.tabset data-width=500}
---
### Get a glimpse of the data
```{r}
insurance <- read_csv("insurance.csv")
glimpse(insurance)
```
### Number of observations and variables in the data.
Number of observations
```{r}
num_observations <- nrow(insurance)
print(num_observations)
```
Number of variables
```{r}
num_variables <- ncol(insurance)
print(num_variables)
```
Question 3
===
Column {data-width=700}
---
### Create a bar plot of region.
```{r bargraph}
ggplot(data = insurance, aes(x = region)) +
geom_bar(fill = "skyblue") +
labs(title = "Distribution of Beneficiaries by Region", x = "Region", y = "Amount")
```
Column {data-width=300}
---
### Summarize your findings
The amount is relatively evenly spread out between the four regions. However Southeast has the most and northeast has the least. All of the regions hvae over 300 beneficiaries.
Question 4 and Question 5
===
Column {data-width=500}
---
### Question 4
Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region.
```{r stacked bar plot}
region_smoker <- insurance %>%
group_by(region, smoker) %>%
summarise(count = n())
ggplot(data = region_smoker, aes(x = region, y = count, fill = smoker)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Distribution of Smokers by Region", x = "Region", y = "AMount", fill = "Smoker Status") +
scale_fill_manual(values = c("blue", "red"), labels = c("Non-Smoker", "Smoker"))
```
Column {.tabset data-width=500}
---
### Question 5 Histogram of bmi
```{r histogram1}
ggplot(data = insurance, aes(x = bmi)) +
geom_histogram(fill = "lightgreen", color = "yellow", bins = 20) +
labs(title = "Distribution of BMI", x = "BMI", y = "Frequency")
```
### Discuss the distribution of the histogram
The middle of the histogram is approximately 30. The right side of the histogram is smoother, while the left side of the curve is steeper.
Question 6
===
Column {data-width=700}
---
### Create a histogram of charges
```{r histogram 2}
ggplot(data = insurance, aes(x = charges)) +
geom_histogram(fill = "lightblue", color = "black", bins = 20) +
labs(title = "Distribution of Charges", x = "Charges", y = "Frequency")
```
Column {data-width=300}
---
### Discuss the distribution of the histogram
The histogram has a steep left side and a smoother right side. the highest point is the second bin in the graph. the right side smooths out and then keeps a semi consistent tail providing a rightward skew.
Question 7
===
Column {data-width=700}
---
### Boxplot that shows the distribution of bmi based on the region
```{r boxplot1}
ggplot(data = insurance, aes(x = region, y = bmi, fill = region)) +
geom_boxplot() +
labs(title = "Distribution of BMI by Region", x = "Region", y = "BMI")
```
Column {data-width=300}
---
### Discuss what you find based on the boxplot
The north east and north west distributions are nearly identical. The south west is just above those two with its median and quarters. The south east is above all of them, with the highest median, QR1, QR3, maximum, and minimum.
Question 8
===
Column {data-width=700}
---
### Scatterplot between age and charges
```{r scatterplot1}
ggplot(data = insurance, aes(x = age, y = charges)) +
geom_point(color = "blue") +
labs(title = "Relationship between Age and Charges", x = "Age", y = "Charges")
```
Column {data-width=300}
---
### Comments on the scatterplot
Overall there is a positive relationship between age and charges. The data can be mostly represented by three different linear lines each with a different y intercept.
Question 9 and Question 10
===
Column {.tabset data-width=500}
---
### Scatterplot of age and charges with smoker as a categorical variable
```{r scatterplot2}
ggplot(data = insurance, aes(x = age, y = charges, color = smoker)) +
geom_point() +
labs(title = "Relationship between Age, Charges, and Smoking", x = "Age", y = "Charges")
```
### Comments on the Scatterplot
There is a positive relationship between age and charges. However, nonsmokers generally have lower charges.
Column {data-width=500}
---
### Question 10
Create two data frames by subsetting insurance data as smoker and nonsmoker.
```{r, echo=TRUE}
smoker <- insurance[insurance$smoker == "yes", ]
nonsmoker <- insurance[insurance$smoker == "no", ]
```
Question 11
===
Column {data-width=700}
---
### Scatterplot of age vs charges for smokers
```{r scatterplot3}
ggplot(data = smoker, aes(x = age, y = charges)) +
geom_point(color = "red") +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Relationship between Age and Charges (Smokers)", x = "Age", y = "Charges")
```
Column {data-width=300}
---
### Comments on the plot and smooth line summary
The smooth line is in the middle of the data points, but few of the data points are close to the line. most are above or below by at least $5000. The line is an accurate way to describe the data, but the data is also nowhere near the line. So, I do not think the line is a good way to summarize the data.
Question 12
===
Column {data-width=700}
---
### Scatterplot of age vs charges for non-smokers
```{r scatterplot4}
ggplot(data = nonsmoker, aes(x = age, y = charges)) +
geom_point(color = "green") +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Relationship between Age and Charges (Non-Smokers)", x = "Age", y = "Charges")
```
Column {data-width=300}
---
### Comments on the plot and smooth line summary
The line is much closer to the data points. The data below the line is nearly on it, but data above is farther away and more random. I think the line is a good representation. This is what I said in assignment six. But when going over this, I do not like the line as a summary.
Question 13 and 14
===
Column {data-width=500}
---
### What you might do next if you want to model charges using other variables in this data
Create models using other variables including BMI, number of children, and region. This builds a comprehensive model for predicting the medical charges.
Column {.tabset data-width=500}
---
### Pie chart of children
```{r pie}
insurance %>%
count(children) %>%
ggplot(aes(x = "", y = n, fill = as.factor(children))) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
labs(title = "Distribution of Beneficiaries by Number of Children", fill = "Number of Children")
```
### Summarize your findings
Most beneficiaries have 0 or 1 child. Next most is 2, then 3, and finally 5 children is represented the least by the beneficiaries. ALmost half of the beneficiaries have no children.
Question 15
===
Column {data-width=500}
---
### Boxplot that shows the distribution of charges based on the number of children
```{r boxplot2}
ggplot(data = insurance, aes(x = as.factor(children), y = charges)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Distribution of Charges by Number of Children", x = "Number of Children", y = "Charges")
```
Column {data-width=500}
---
### Discuss what you find based on the boxplot
The median charges is more or less constant no matter of the amount of children. The max amount of money payed based on charges was a beneficiary with 0 children. Beneficiaries with 2 children have the highest QR3. Beneficiaries with 1 child have the lowest QR1. Beneficiaries with 4 children have the highest median.
Comments on the scatterplot
Overall there is a positive relationship between age and charges. The data can be mostly represented by three different linear lines each with a different y intercept.